Aggressive assembly of pyrosequencing reads with mates

نویسندگان

  • Jason R. Miller
  • Arthur L. Delcher
  • Sergey Koren
  • Eli Venter
  • Brian Walenz
  • Anushka Brownley
  • Justin Johnson
  • Kelvin Li
  • Clark M. Mobarry
  • Granger G. Sutton
چکیده

MOTIVATION DNA sequence reads from Sanger and pyrosequencing platforms differ in cost, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a 'hybrid' approach to whole-genome shotgun sequencing projects, but assembly software must be modified to accommodate their different characteristics. This is true even of pyrosequencing mated and unmated read combinations. Without special modifications, assemblers tuned for homogeneous sequence data may perform poorly on hybrid data. RESULTS Celera Assembler was modified for combinations of ABI 3730 and 454 FLX reads. The revised pipeline called CABOG (Celera Assembler with the Best Overlap Graph) is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. In tests on four genomes, it generated the longest contigs among all assemblers tested. It exploited the mate constraints provided by paired-end reads from either platform to build larger contigs and scaffolds, which were validated by comparison to a finished reference sequence. A low rate of contig mis-assembly was detected in some CABOG assemblies, but this was reduced in the presence of sufficient mate pair data. AVAILABILITY The software is freely available as open-source from http://wgs-assembler.sf.net under the GNU Public License.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Annotation of metagenome short reads using proxygenes

MOTIVATION A typical metagenome dataset generated using a 454 pyrosequencing platform consists of short reads sampled from the collective genome of a microbial community. The amount of sequence in such datasets is usually insufficient for assembly, and traditional gene prediction cannot be applied to unassembled short reads. As a result, analysis of such datasets usually involves comparisons in...

متن کامل

Assembly of genomic reads of elite indica rice cultivar onto 2101 reference bacterial genomes for identification of co-sequenced endophytic bacteria

Reference based assembly of genomic reads of the elite indica rice cultivar RP Bio-226 was carried out against 2101 reference bacterial genomes using Bowtie-2 genome assembly tool. Five types of data: Number of paired end reads concordantly aligned exactly only once, number of paired end reads concordantly aligned more than once, number of mates that make the pairs aligned exactly only once, nu...

متن کامل

Sequencing the Bonobo Genome

The Bonobo Genome Consortium generated DNA sequencing reads representing the genome of a single bonobo individual. The data consisted of almost 270 million fragment sequences generated on FLX machines from 454 Life Sciences. The fragments derived from FLX standard and Titanium chemistries, and from paired and unpaired protocols. The data was assembled at the J. Craig Venter Institute with the o...

متن کامل

Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data

Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes...

متن کامل

Filtering duplicate reads from 454 pyrosequencing data

MOTIVATION Throughout the recent years, 454 pyrosequencing has emerged as an efficient alternative to traditional Sanger sequencing and is widely used in both de novo whole-genome sequencing and metagenomics. Especially the latter application is extremely sensitive to sequencing errors and artificially duplicated reads. Both are common in 454 pyrosequencing and can create a strong bias in the e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 24  شماره 

صفحات  -

تاریخ انتشار 2008